Goto

Collaborating Authors

 forward selection




Automated Model Selection for Tabular Data

arXiv.org Artificial Intelligence

Structured data in the form of tabular datasets contain features that are distinct and discrete, with varying individual and relative importances to the target. Combinations of one or more features may be more predictive and meaningful than simple individual feature contributions. R's mixed effect linear models library allows users to provide such interactive feature combinations in the model design. However, given many features and possible interactions to select from, model selection becomes an exponentially difficult task. We aim to automate the model selection process for predictions on tabular datasets incorporating feature interactions while keeping computational costs small. The framework includes two distinct approaches for feature selection: a Priority-based Random Grid Search and a Greedy Search method. The Priority-based approach efficiently explores feature combinations using prior probabilities to guide the search. The Greedy method builds the solution iteratively by adding or removing features based on their impact. Experiments on synthetic demonstrate the ability to effectively capture predictive feature combinations.


Predicting Failure of P2P Lending Platforms through Machine Learning: The Case in China

arXiv.org Artificial Intelligence

This study employs machine learning models to predict the failure of Peer-to-Peer (P2P) lending platforms, specifically in China. By employing the filter method and wrapper method with forward selection and backward elimination, we establish a rigorous and practical procedure that ensures the robustness and importance of variables in predicting platform failures. The research identifies a set of robust variables that consistently appear in the feature subsets across different selection methods and models, suggesting their reliability and relevance in predicting platform failures. The study highlights that reducing the number of variables in the feature subset leads to an increase in the false acceptance rate while the performance metrics remain stable, with an AUC value of approximately 0.96 and an F1 score of around 0.88. The findings of this research provide significant practical implications for regulatory authorities and investors operating in the Chinese P2P lending industry.


Beam Search for Feature Selection

arXiv.org Machine Learning

In this paper, we present and prove some consistency results about the performance of classification models using a subset of features. In addition, we propose to use beam search to perform feature selection, which can be viewed as a generalization of forward selection. We apply beam search to both simulated and real-world data, by evaluating and comparing the performance of different classification models using different sets of features. The results demonstrate that beam search could outperform forward selection, especially when the features are correlated so that they have more discriminative power when considered jointly than individually. Moreover, in some cases classification models could obtain comparable performance using only ten features selected by beam search instead of hundreds of original features.


Data Summarization via Bilevel Optimization

arXiv.org Machine Learning

The increasing availability of massive data sets poses a series of challenges for machine learning. Prominent among these is the need to learn models under hardware or human resource constraints. In such resource-constrained settings, a simple yet powerful approach is to operate on small subsets of the data. Coresets are weighted subsets of the data that provide approximation guarantees for the optimization objective. However, existing coreset constructions are highly model-specific and are limited to simple models such as linear regression, logistic regression, and $k$-means. In this work, we propose a generic coreset construction framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem. In contrast to existing approaches, our framework does not require model-specific adaptations and applies to any twice differentiable model, including neural networks. We show the effectiveness of our framework for a wide range of models in various settings, including training non-convex models online and batch active learning.


Feature Selection using Wrapper Method with Python Implementation

#artificialintelligence

In today's era of Big data and IoT, we are easily loaded with rich datasets having extremely high dimensions. In order to perform any machine learning task or to get insights from such high dimensional data, feature selection becomes very important. Increase in complexity of a model and makes it harder to interpret. Increase in time complexity for a model to get trained. Hence, it gives an indispensable need to perform feature selection.


ERFit: Entropic Regression Fit Matlab Package, for Data-Driven System Identification of Underlying Dynamic Equations

arXiv.org Machine Learning

Data-driven sparse system identification becomes the general framework for a wide range of problems in science and engineering. It is a problem of growing importance in applied machine learning and artificial intelligence algorithms. In this work, we developed the Entropic Regression Software Package (ERFit), a MATLAB package for sparse system identification using the entropic regression method. The code requires minimal supervision, with a wide range of options that make it adapt easily to different problems in science and engineering. The ERFit is available at https://github.com/almomaa/ERFit-Package


The energy distance for ensemble and scenario reduction

arXiv.org Machine Learning

Scenario reduction techniques are widely applied for solving sophisticated dynamic and stochastic programs, especially in energy and power systems, but also used in probabilistic forecasting, clustering and estimating generative adversarial networks (GANs). We propose a new method for ensemble and scenario reduction based on the energy distance which is a special case of the maximum mean discrepancy (MMD). We discuss the choice of energy distance in detail, especially in comparison to the popular Wasserstein distance which is dominating the scenario reduction literature. The energy distance is a metric between probability measures that allows for powerful tests for equality of arbitrary multivariate distributions or independence. Thanks to the latter, it is a suitable candidate for ensemble and scenario reduction problems. The theoretical properties and considered examples indicate clearly that the reduced scenario sets tend to exhibit better statistical properties for the energy distance than a corresponding reduction with respect to the Wasserstein distance. We show applications to a Bernoulli random walk and two real data based examples for electricity demand profiles and day-ahead electricity prices.


Feature Selection in Machine Learning

#artificialintelligence

In the real world, data is not as clean as it's often assumed to be. That's where all the data mining and wrangling comes in; to build insights out of the data that has been structured using queries, and now probably contains certain missing values, and exhibits possible patterns that are unseen to the naked eye. That's where Machine Learning comes in: To check for patterns and make use of those patterns to predict outcomes using these newly understood relationships in the data. For one to understand the depth of the algorithm, one needs to read through the variables in the data, and what those variables represent. Understanding this is important, because when you need to prove your outcomes, based on your understanding of the data.